Number of rows in the dataset
## [1] 4898
List of variables in the dataset
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Overall summary
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
As very new to wine characteristic, I did some research on the variables name to understand their impact on wine taste.:
The X is the anonymized unique ID of the wine, so let's make it as factor.
As our task is to indentify the chimical propoerties which influence the quality, let's lot at it first.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The minium of 3, maxium of 9 and 50% of the values between 5 and 6.
The distribution of the quality look kind of normal with a peak at 6. Quality values are very concentrate. Let's see which percentage of the sample each value represents.
##
## 3 4 5 6 7 8
## 0.004083299 0.033278889 0.297468354 0.448754594 0.179665169 0.035728869
## 9
## 0.001020825
Well that's around 45% of the wine with 6, nearly half of the sample. 6 seems like a very average value. The sum of 5, 4 and 3 account for around 33%. The sum of 7,8 and 9 account for around 22%. Seems that we could use those group to categorize our wines. From 3 to 5 will be the low quality wines. 6 the average quality. And 7 to 9 the high quality.
Fixed acidity is indicate as tartaric acid in the data description. Tartaric acid is a distinctive molecule. However this online resource indicates that fixed acid, it's a class of acid which include tartaric acid and citric acid.
Let's see some stats first
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The majority of the wine have between 6.3 and 7.3. There is some high outliners at 14.2. The measurment seems to have a 0.1 precision.
For the fixed acidity we have a normal distribution.
From online research, volatile.acidity is the steam of distillable acids. Note that the US legal limit is 1.1 g/L. I assume that our data are in g/L. It is normaly not detectable up to 3g/L.
Let's get some stats
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Most of the values are between 0.21 and 0.32. Again some pretty hight outliers at 1.1 (which is exaclty the US legal limit). Looking at some data it seems that the precision is 0.01
The shape of the volatile acidity is approaching normal distribution. However there are many mini drops in the distribution. Let's use a smaller binwidth of 0.005.
I have the impression that the sampling of the measurment machine was not properly done. We get many 0.0X precision and very few 0.0X5 precision. I will adopt a 0.01 bin size to smooth the plot.
From my internet search citric acid is contributing to the fixed acidity. It' is usualy present between 0 to 0.5g/L in wine.
Let's get a finer grain bin size
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Most of the values are between 0.27 and 0.39. Again high outliners are present with 4 time the concentration toping at 1.66. A binwidth of 0.01 seems adapted.
Most of the citric acid concentration are following a normal distribution with a peak at 0.3g/L. Again a few outliners at 1.25g/L and 1.7g/L. Note two very non-normal peaks at around .5 and .75.
Let's look at the exact counts at those 2 strange peaks.
##
## 0.49 0.74
## 215 41
There is:
Instinctively, it seems the result of a carefully controlled additive to the wine. Indeed citric acid can be used to boost acidity and add “freshness”. But one shouldn't add too much otherwise as the it adds a strong citric flavor.
Let's create a categorical variable for those value of citric acid.
The residual sugar that was not transformed during frementation in g/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Most of the bottles are between 1.7 and 9.9. Seems to have very high ouliners again 65. It seems that 0.1 would be a right binwidth.
Let see if we can get a normal distributon by taking the sqrt of the residual sugar.
Well not very convincing… Let see with log10 of the residual sugar.
Seems a bit better, we get a bimodale normal distribution.
As describe on the wikipedia page, there are categories of wine regarding sweetness.
It seems that we have a majority of dry wines… let's create a factor variable.
##
## dry medium dry medium sweet
## 0.428133932 0.403225806 0.168436096 0.000204165
The majority of our wines are either dry or medium dry. A fith of the bottles are medium wines. Only one bottle is a sweet wine.
The amount of salt in the wine in g/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Majority of the concentration are between 0.036 to 0.05. Again some very high outliers. It seems like 0.001 would be appropriate bin. Let's also remove 1% of the outlisers.
The graph is following a normal distribution between 0.009 to 0.069. However we have a kind of long tail from 0.08 up to 0.16.
Let's try without the 3% highest values.
The distribution seems bimodale.
Free sulfur dioxide represent the free molecule of S02 in mg/dm3 and work as a preservative. This molecule is easily detectable above 50ppm.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Most of the values are between 23 and 46. Again at least one high outlier again at 289. A binwidth of 1 seems appropriate.
The distribution has a quite flat-ish normal shape.
A total amound of S02 in mg/dm3. It include the free sulfure dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The majority of the values are between 108 and 167. It seems that we get high and low outliers. A binwith of 1 seems appropriate.
The data distribution has a lot of noise. A wider bin would attenuate this noise.
As often one see “contain sulfites” on wine bottle because less than 1% of the population is sulfit-sensitive. The label must be present with concentration higher than 10ppm. In the US the maximum authorized is 350ppm. It is also used as a measure for organic wine with maximum of 100ppm. Read more here.
For liquide 1mg/L approximate of 1ppm. So if we want to represent those thresold on the graphe.
It seems that all our white wines would have display in the “Contains Sulfites”. Still a portion of them could be consider are organic. 2 wines of our sample would not be authorized in the US.
Apparently this 10ppm thresold is health issue than anything to do with wine quality but still let's create a new variable contains.sulfies with 3 groups less than 10, between 10 and 100 and more than 100
##
## no negligable low normal high
## 0.0000000000 0.0004083299 0.1880359330 0.8111474071 0.0004083299
Our sample contains:
According to the practical winemaker journal the ratio between free SO2 and total S02 is key for the preservation of the wine. So let's explore this ratio
We get a normal distribution of the ratio. Most of the values are contain between 10% to 40%
The article also mention that For dry table wines the level of free sulfur is usually somewhere around 40% to 75% of the level of total SO2. Well let's cross check with our sample.
Very few of our dry wine sample are contained in 40% to 75% ratio. Most of our wine are below 40%. After reading multiple time and double checking my variables and the article, i cannot figure out how our sample ratio is so different.
As this ratio seems important into wine conservation, let's add it as a variable keeping in mind that we couldn't really validate our values.
The density of the wine. The reference is the density of water equal to 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Most of the values are between 0.9917 and 0.9961. Not very sure how far are the outliers. Let's choose a binwidth of 0.0001.
The density distribution seems normal and trimodal with peaks approximatively at 0.992, 0.996 and 0.998.
pH ss a indicator of how acidic or basic the wine is.
Let see the stats of the pH.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The mean 3.188 and median 3.180 are nearly indentical. So all our white wine are acidic with value between 2.7 and 3.8. Seems that there is a 0.01 precision on the measurements. No real outliers here.
The pH seems to follow normal distribution.
Sulphates (or potassium solphate) are a wine additive for antimicrobial and antioxidant. It can also be use as fertilizer.
Let's look at the stats
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Most values lies between 0.41g/L and 0.55g/L. Seems like a 0.01 would fit our bin size. No realy clear outliers here.
The data curve is kind of normal and bimodal. From the table we can find a peak at 0.38 and at 0.5. We can also more cleary spoted some outliner above 1.0g/L
Alcohol is quite self explanatory… as a percentage per volume. 11.6% is consider as a global average.
Let look at the stats
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Mose of the values are between 9.5% and 11.4%. 0.1 seems like a good binwidth. No real outliers.
The distribution seems rather normal-ish as average perspective. But the curve is pretty noisy. It also seems ot have 3 different groups. A low alcohol group below 10%, a medium group between 10.5% and 11.5% and a high alcohol group above 12%.
There are 4,898 white wines in the dataset with 13 variables:
Main observations:
The most important feature is the quality. For the rest of the features, it's not easy at this stage to clearly identify which one is really important. A good wine is a well balanced composition that doesn't seems connected one particular chimical components.
Still difficult to indentify which feature will help, but the density, the alcohol and suflur dioxine, volatile acidity (the vinager taste) might be more helpful.
I created 3 categorical variables and 1 continious variable.
The first categorical is sweetness. The residual.sugar has been used to categorize the wines.
The second categorical is contains.sulfites. It's more a reglementation mark than any taste category but it could be interesting.
The third categorical is add.citric.acid. A boolean to mark the wine with an non-normal concentration of citric acid.
The continious varible is ratio.sulfur.dioxide, the ratio of free.sulfur.dioxide over the total.sulfur.diovide.
The residual sugar had a kind of long tail distribution. By doing a log10 transformation it became a bimodal normal distribution. I didn't changed the value but will keep in mind this property of the distribution.
According to the matrix the density and residual.sugar have a strong correlation at 0.83. Let's visualize in a scaterplot.
It looks like a linear relastionship.
Well we have a strong relashionship and it definitly make sense. Indeed the more you add suggar in liquide, the more liquide will disolve the sugar and increase in density.
A second strong correlation number is between the alcohol and the density with -0.78. Let's create a scater plot to explore this relationship.
The alcohol and density seem to follow a linear relationship. Which make definitly sense as the density of alcohol is lower than the water ( which is 1). The more concentrate in alcohol the more the density is going down.
A third correlation number is a moderate 0.53 between the total sulfur dioxide and the density.
The scater plot is not very convincing. It looks like a small correlation relationship.
Between the total sulfur dioxide and the residual sugar, there is correlation moderate coefficient of 0.47. Let's have a closer look.
A bit confusing to get any information from this graph. An additional variable might be usefull here.
A positive moderate correlation number of 0.43 was spotted in the matric between the quality and the level of alcohol.
Look like the good wine of our sample have more alcohol. In average higher quality wines contain more alcohol than the average wines. Note that the average wine quality have a lower alcohol than the worst wine quality.
Another moderate negative correlation number of -0.45 between the pH and the fixed acidity.
We clearly see that the more fixed acidity the lower the pH. This makes totally sense as the low pH is more acid.
The total and the free sulfure dioxides have a correlation coefficient of 0.61. Let's investigate more.
The relationship look linear. Which in a way make sense as free sulfure dioxide is part of the total sulfure dioxide. Let's now plot the relationship between the total - fee vs free.
Well not very conclusive, we arrive at a rather low correlation shape.
##
## Pearson's product-moment correlation
##
## data: wqw$total.sulfur.dioxide - wqw$free.sulfur.dioxide and wqw$free.sulfur.dioxide
## t = 19.1158, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2372821 0.2894077
## sample estimates:
## cor
## 0.2635373
Only a weak 0.26 correlation coefficient.
I read that the ratio sulfur dioxide influence the pH. Let's check if we get something….
Well it look like a correlation of 0… Definitly no related.
Let's compare pH in different quality
The very best wines have a very controlled/narrowed pH. As opposed as the worst wines that are more spread and lower -more acidic- pH. There is much more outliners for average wines quality (5 and 6) but those are the vast majority of our sample. The quality 5 has the lowest mean of pH.
Chloride (or salt) is a great taste enhancer, let see the relationship with quality
Well best wines don't have a low level of clorine and again the biggest quality. We can spot again many outliners for the quality 5 and 6. Let's try to get more details.
The better the wine, the lower the chloride level. Except for the worst wines (graded 3 and 4), are those not even worth a bit of chloride?
Too much volatile acidity is supposed to produce the vinager smell of the wine. Let's see if the worst wine are the one with a vinager smell
Actually the worst wines (quality 3) don't have the highest level of volatile acidity. However the wines of quality 4 have the highest average concentration and a few high outliners.
Let's compare density in different quality groups
The best wines (quality 7, 8 and 9) have in average a lower density.
Let's compare total sulfur dioxide according to quality groups
Intersting plot as the better the quality the more narrow the variation of total sulfur dioxide. It's as if the best wine producers are more in control of the sulfur dioxide and don't let it variate much.
Let see if the wine with those non-normal levels of citric acid are rated in quality.
Well the non normal concentration for qualities 4, 5, 6 and 7. For 3, 8 and 9 you cannot spot a peak at 0.49 and 0.74.
Let's look a those peaks by computing the percentage of bottles in each quality which clearly have additional citric acid.
Seems that at least 20% of the best wine quality 9 are suspircious of adding citric.acid. The other quality groups are more around 3% to 6%. The small quantity of quality 9 bottles (0.1%) might explain this very high percentage. Indeed only 1 unlucky bottles at 0.49 citric acid would make this 20%.
On one hand two features have a positive effect on the density the sugar and total suflure dioxide. On the other hand the alcohol has a negative effect on the density.
The best white wines have a low density and high alcohol. Therefor a wine producer should maximise the fermentation to consume most of the residual sugar to make as much alcohol a possible.
The free and total sulfure dioxides were correlated because the later is containing all of them. The difference between the total and the free sulfure dioxides is called bound sulfure dioxide. In our sample the bound and free sulfure dioxide only have a weak (0.26) correlation coefficient.
The total sulfur dioxide and pH variation on quality seem to tell the story that the wine producer who make better wine are more in control of the sulfure dioxide or the pH.
The strongest relationship was between the density and residual sugar. The density is strongly positively correlated with the residual sugar. An also strong negative correlation exist between the alcohol and the density.
As exposed in the bivariate plots about the relationship betwee density, residual.sugar and alcohol. Let's get a better feeling of it.
We can clearly see that for a given residual sugar with higher alcohol the density is lowering. When the residual sugar increase the alcohol is lower.
The better wine (7 to 9) have on average a lower residual sugar and higher alcohol concentration. The worst wine (3 to 5) don't produce a lot of alcohol. The average wines (6) has those 2 caracteristics.
The average wines (6) are a good subset to repesent those 2 characteristics.
Let's have a look again at the total.sulfur.dioxide vs residual.sugar. Maybe by adding quality as color it would help us identify a pattern.
## Error: Continuous value supplied to discrete scale
Well not really helpful ….
As continuity with the previsou graphs, let's see how our sweetness variable can be used.
I like this plot as it connect to my past experience with different wine sweetness.
We rathe see a relationship between density and alcohol on this previous plot.
Well those last two plots are not really helping us in our exploration. let's drop the suflur dioxide and look form the angle of the contains.sulfites variable
No really trend there all quality/pH are mixed within the different contains.sulfites categories.
Well the only insight i get from this graph is that the low sulfites seems to be on average of higer alcohol. Let's go back to a simple boxplot.
Back to square one with the understanding and visualisation of the suflure dioxide. I'm a bit clueless. Let's try with pH.
Seems like another dead end.
In the bivariate analysis, the chlorides and the quality had a curved relationship. The higher the quality the lower the lower the chlorides concentration. Let's draw the quality vs the alcohol with the chlorides as color. The top and bottom 10% outliers of chloride are removed.
Intersting view. We can still see that the lower the quality wine the higher the chlorides concentration. The graph gives a feeling that higher chlorides concentration is associated with lower alcohol precentage. However this is just an overall feeling, indeed quite a few wines with low chlorides have high alcohol precentage and the opposit is also true.
Now let's look at the alcohol concentration from the chlorides and total sulfur dioxide variable.
Mmmm the 2 components are quite effective to increase the percentage of alcohol.
I was looking at “Which chemical properties influence the quality of white wines?”.
From my exploratory data analyse, it appears that the pH, the residual sugar, the density, chlorides and the alcohol can help us identify a good wine. The lower the residual sugar and the chlorides and the higer the pH, the density and the alcohol, the better the wine.
The alcohol concentration is a good approximation of the quality of the wine as it illustrates that the fermentation process was well done and very little residual sugar is left in the bottle.
However a good wine appears to be the right balance of many chemical properties that prevent me to identify a linear model.
Regarding citric acid, it seem and additive commonly used accross all the quality of wines. I would have expect that good quality wine would not rely on such additive. Also i need to find a official confirmation but European Union might not allowed this additive.
No i fail to identify or transform my variables to support a linear model.
This plot is a scaterplot representing density vs alcohol with as color the wine sweetness categories. The sweetness category has been choosed over residual sugar as a more familiar label for the wine comsumer. The data set only contains three sweetness categories: dry, medium dry and medium. A doted line at density of 1 is marking the water reference point. The points have been made almost tranparent in order to avoid overplotting and keep visible the linear regressions of each sweetness category.
One can clearly see almost parallele lines of the the dry and medium dry wines. The lines seem to converge outside the graph as alcohol would increase. The medium wine line is a bit more uncertain and short but it seems to follow the same patern to convertn with the two others. A reason of shortness of the medium wines is due to our sample that only contains 16% of them against more than 40% for each dry and medium dry wines. This projected convergence makes sense as the more alcohol the less room for other composants, therefore the density would ultimatly converge.
It is interesting to see that for a given concentration of alcohol the dry wines have a lower density than the medium dry wine; and those medium dry would have a lower density than the medium wines. Therefor when chosing a wine in the store, one could eventually estimate the density of the wine by looking at the sweetness category and the precentage of alcohol.
This second plot shows the respective suspicious concentration of citric acid at 0.49g/L and 0.74g/L. The usage of acid citric as additive in wine is sticlty reglemented by the EU. Portugal's Vino Verde region is in Zone C 1 where it's only allowed to use this additive for exceptional years.
The 0.49g/L and 0.74g/L concentrations were identified as “suspicious” because the otherwise normal distribution of citric acid is making significant peaks for those values. As citric acid is a taste enhancer, it is usually be added to give more freshness.
What is striking is that 1 over 5 bottle of the quality 9 wines have a suspicious concentration of 0.49. For other qualities, in average 1 bottle every 20 can be considered as suspicious. Note also that none of the quality 3 wines have those suspicious concentrations. The peak of quality 9 wines can find an explanation in the small number of such wine in the dataset, the 20% is only 1 bottle of quality 9. The year of production is missing in order to identify if the addition of citric acid was legal or not.
This third plot illustrates the impact of chlorides vs total suflur dioxide on the alcohol concentration. The higher level of alcohol above 11% are mainly achieved with lower concentration of both components. When the concentartion of clorides increase above 0.05 g/L most of the wine have a lower than 11% percentage of alcohol. Same lower alcohol precentage for a conentration of total sulfur dioxide above 150 g/L.
Considering that both sodium and suflur dioxide are key components for the fermentation process, it's interesting to see that an excess of both of them could be associate with lower alcohol concentration. Indeed chlorides can improve the fermentation while without suflur dioxide there would be no fermentation at all.
The exercise during the Udacity lessons 3 was much easier than figuring out a direction without guidance for this project. One has to go step by step. Even with a resonable number of variables (around 17 here) it was very difficult for me not to get lost. I had to move back and forth on this report to correct wrong conclusions or move plots from the univariate section to the bivariate or trivariate section.
Another big source of struggle was to match the dataset's variable with other information that i could find online. The names sulfates or sulfure or sulfite were a greate source of confusion. To add to the naming confusion some online searches provided very different averages for example with the ratio.sulfure.dioxide. After reading multiple sources online and coming back to the dataset description i slowly learnt the different componants but still i'm a bit puzzled with the difference in average.
The little success was to discover something that i already know (relationship between sugar, density and alcohol) but mostly the success feeling came when i selected the right graph for my purpose. I easily got suck in the analyze with certain type of graph. For example i couldn't find a way out with scater plot and histogram until i got the idea of using a boxplots which made a lot of relationships clearer. I also liked to add sweetness as a variable which helped me connect with the subject.
The next steps for further analyzes would be to